September 11, 2018
Give food for thought.
Part 1: A Conceptual Overview
Part 2: Programming Refresher
Amazon’s recommendation engines examine past purchase behavior to personalize and surface products to relevant consumers.
Orbital Insights deploys computer vision algorithms to count cars at retail store parking lots in order for investors to better approximate quarterly earnings.
Quandl used a panel of email receipts to predict the effects of an Uber platform charge policy.
Zillow collected most of the housing sales data in the US and constructed a price prediction model to help sellers price their homes more competitively when they choose to put the property on the market.
FDNY among other city fire departments around train algorithms to predict where fires will in order to target fire safety inspections.
Advanced time series modeling can be used to detect anomalous activity so that online platforms can safeguard their assets.
All of these examples have a common structure.
What is the underlying structure?
Data science has a rather fluffy definition as it is interdisciplinary field. But generally, there is agreement that data science sits at the intersection of mathematical inference, computer science and subject matter expertise.
Venn diagram by Drew Conway
Data science uses statistical inference and computational algorithms to develop applications that communicate an actionable insight.
\[\text{Data Science} = f(\text{Statistics, Computer Science})\]
Modern statisticians will claim that data science is no different, but they tend not to operationalize insight. Computer scientists have long developed applications, but are not interested in inference.
\[\text{Statistics} \neq \text{Computer Science} \]
Data science is not social science as it does not rely on formal social theories, but rather starts from the first principles of inference from data.
\[\text{Data Science} \neq \text{Social Science} \]
Data science is also a marketing term and part of a larger universe of buzzwords:
Data science is also a marketing term and part of a larger universe of buzzwords:
| Yea | Nay | |
|---|---|---|
| Adoption | Embraced mostly in fields where there is a rapid expansion of data and theory has not yet formed | Data science tends to clash with well-established fields. |
| Interpretation | Underlying skills can be flexibly applied | There isn’t a gold standard for how it should be applied. |
| Staffing need | Very few people required to do the job well | Fluffy definition means little quality control of who is a data scientist |
| Long term outlook | Skills will likely persist into the future | Whereas early days of data science focused on generalist practitioners, future practitioners will be field specific – re-adsorbed into host field. |
Data science has been hyped. It is really easy to get our hands on more data and tech. Many data practitioners will use the tools like tossing spaghetti on walls and hope that something sticks.
Who’s left? Natural and social scientists like economists who tend to be skeptical. They are good at asking a lot of pointed quesitons.
Applications tend take on one of three forms:
What we do with the data needs to have a point.
Models 1 and 2 have been estimated for time series \(y\).
Models 1 and 2 have been estimated for time series \(y\).
| Story-Driven | Model-Driven |
|---|---|
| Story derived from regression coefficients, but model that has low empirical accuracy. | The number may be very close to reality, but the model does not lend itself to tell a story. |
| Tend to place greater weight on a variables. | Tend to focus on minimizing error. |
Remainder of today will focus on prediction.
Note: Different companies will have a different ways of describing this process – it’s largely for marketing purposes, but is effectively the same.
Association score can rank relevant products to facilitate sales. User purchases are cleaned and aggregated into a user-product matrix, then cosine similarity is calculated to find which products are correlated with other products.
Predicted price to help set the asking price. Housing sales records are structured into a cross-sectional data set of housing attributes, then a regression is used to correlated house attributes to the price in order to predict prices of unsold houses.
Set expectations of users and facilitates more targeted search. Job listings with salary data are structured into a cross-sectional data set of job attributes, then a regression is used to correlated job attributes to the salary in order to predict salaries for postings that are missing that information.
Past Uber rides are turned into a time series for every half mile grid cell for a given city. A time series model like ARIMA, Neural Net, Theta Algorithm, Holt-Winters or other method is applied to predict the level of expected ridership over the next few hours. This is then used to send alerts to drivers to direct them to hotbeds.
All of these cases use terabytes to petabytes of information for daily production. Steps 2 through 5 are engineered as a program that can run on its own.
Set the initial starting parameters by answering these questions
\[ cos(\theta) = \frac{\sum^n_{i=1}A_iB_i}{\sqrt{\sum^n_{i=1}A_i^2}\sqrt{\sum^n_{i=1}B_i^2}} \] - What are the success criteria? Whether recommended products were clicked on, whether recommended products were purchased more than products with lower probability.
| (1) Contemporaneous Model | (2) Traditional Lag (Momentum) |
|---|---|
| \[y_t = f(x_t)\] | \[y_{t} = f(x_{t-1})\] |
Very few rules except that the data reflect the state of the world at the time of each estimate cycle so that we can simulate forecasting.
Modeling is typically split between model training and model validation.
Given a target \(y\) and inputs \(X\), training means calibrating a machine learning algorithm to mimick Y subject to a loss function.
Goal is to find the model that produces the lowest loss function = highest accuracy.
Even regression (\(y = \beta_0 + \beta_1 x_i + \epsilon\)) is an iterative process minimizing the mean squared error.
Modeling in the social sciences tend to focus on parametric methods (e.g. linear regression) as they lend themselves to a story. In data science, the goal is predicting the number, thus methods are far more diverse and more complex.
| Common Econ Methods | Data Science Methods |
|---|---|
| - Linear regression | - Linear regression |
| - Stepwise regression | - Stepwise regression |
| - ARIMA | - ARIMA |
| - Bayesian Vector Autoregression | - Bayesian Vector Autoregression |
| - Rolling Average | - Regularized regression (LASSO, Ridge, Elastic Net) |
| - Neural Networks, LSTM | |
| - Support Vector Machines (SVM, SVR) | |
| - Random Forests | |
| - Boosting (Adaptive, Extreme) | |
| - Ensemble averaging and stacking | |
| - +1000’s more |
When producing predictions, the data are usually split into a training and testing set.
Remember the goal is to find the model that has the lowest loss function. As the model iteratively learns from data, it will have seen the data multiple times. It’s like allowing a student to cheat off of all other students in the class.
When students learn test answers but not the concept, it is similar to when models learn data values but not the patterns. This is called overfitting.
Partitioning the data involves splitting the data into a training set for the student to learn. Then giving it a previously unseen scenario to test it on.
Would you limbo twice?